All Questions
58 questions
0votes
2answers
71views
How to manage large datasets (approx 95GB)
I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...
2votes
1answer
737views
The single CSV created by combining a large number of CSV files is too large to process. What options do I have?
The dataset I am currently working on has more than 100 csv files, with each of size more than 250MB. These are files containing time series data captured from different locations and all the files ...
2votes
1answer
65views
Best way to preprocess data
I need to create a machine learning model to predict if a structure is an hotel or an apartment. I have a dataset structured as well: ...
1vote
1answer
290views
How do I create a dataset from many CSV files that is too large for RAM
I have been handed about 40 GB of CSV files that I need to turn into a database. The files are arranged in a file structure that uses location in that file structure to create a relationship between ...
0votes
1answer
519views
How to do EDA on large datasets
I have a table in Postgres with ~5million records. When I load the dataset using pandas to perform EDA, I run out of memory. ...
1vote
1answer
43views
Where can I find a dataset that contains criminal case sentencing data? [closed]
I would like to study a dataset where each record represents a criminals case in the US and contains attributes such as: Type of crime Defendant Age/Sex/Race Plea Verdict Sentence Is there a dataset ...
2votes
1answer
121views
What is the difference between Pachyderm and Git?
I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible ...
2votes
1answer
223views
Size of datasets over years
I am looking for statistics, to understand the evolution of the size of the (public) dataset over the years. I just found the following statistics: The poll of KDnuggets that actually shows that over ...
3votes
1answer
3kviews
How does skewed data affect deep neural networks?
I'm playing around with deep neural networks for a regression problem. The dataset I have is skewed right and for a linear regression model, I would typically perform a log transform. Should I be ...
1vote
1answer
344views
Public dataset for news articles with their associated categories for multilabel data classification
I am wondering if there are any public datasets of news, like The New York Times (NYT) or similar to various news categories such as politics, entertainment, lifestyle, general news, sports, etc. I ...
3votes
1answer
10kviews
How to determine sample rate of a time series dataset?
I have a dataset of magnetometer sensor readings which looks like: ...
1vote
0answers
34views
What is important for Pharmaceutical companies to answer with Big Data Analysis?
I am a data scientist, and I have some biological background (genetics). I have been asked to give a talk for our customers from pharmaceutical industry. I should show them how they benefit from Big ...
0votes
1answer
25views
Suggestion of dataset
I am implementing my own deep network, but I am not so good at calculus so my network only works for binary data in the moment. I have been searching for big tabular datasets that are for binary ...
1vote
1answer
2kviews
How to compute modulo of a hash?
Let's say that I have a set of users in my database, that have GUIDs as their IDs. I use xxhash to generate fixed-length hashes for each value, so that I can then ...
2votes
0answers
69views
How to deal with large datasets? [closed]
I have some experience with data science but I wanted some insight on how to deal with a very large dataset. I understand simply downloading it to your computer is not plausible so where do you even ...